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ABSTRACT 

Research efforts to model speech perception in terms 
of a processing system in which knowledge and processing are 
distributed over large numbers of highly interactive — but 
computationally primative — elements are described in this report. 
After discussing the properties of speech that demand a parallel 
interactive processing system, the report reviews both 
psycholinguistic and machine-based attempts to model speech 
perception. It then presents the results of a computer simulation of 
one version of an interactive activation model of speech, based 
loosely on the COHORT model, devised by W. D. Marslen-Wilson and 
Welsh (1978), which is capable of word recognition and phonemic 
restoration without depending on preliminary segmentation of the 
input into phonemes. The report then notes the def iciences of this 
model, among them its excessive sensitivity to speech rate and its 
dependence on accurate information about word beginnings. It also 
describes the TRACE model, which is designed to address these 
deficiencies, noting that it allows interactive activation processes 
to take place within a structure that serves as a dynamic working 
memory. The report points out that this structure permits the model 
to capture contextual influence in which the perception of a portion 
of the input stream is influenced by what follows it as well as by 
what precedes it in the speech signal. (FL) 
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Researchers who have attempted to understand higher-level mental 
processes have often assumed that an appropriate analogy to the organization of 
these processes in the human mind was the high-speed digital computer. How- 
ever, it Is a striking fact that computers are virtually incapable of handfrig the 
routine mental feats of perception, language comprehension, and memory retrieval 
which we as humans take so much for granted. This difficulty b especially 
apparent In the case of machine-based speech recognition systems. 

Recently a new way of thinking about the kind of processing system In which 
these processes take place has begun to attract the attention of a number of 
investigators. Instead of thinking of the cognitive system as a single high-speed 
processor capable of arbitrarily complex sequences of operations, scientists In 
many branches of cognitive science are beginning to think in terms of alternative 
approaches. Although the details vary from model to model, these models usually 
assume that information processing takes place In a system containing very large 
numbers of highly interconnected units, each of about the order of complexity of 
a neuron. That is, each unit accumulates excitatory and Inhibitory inputs from 
other units and sends such signals to others on the basis of a fcJriy simple 
(though usually non-lnear) function of Its inputs, and adjusts Its Interconnections 
with other units to be more or less responsive to particular Inputs In the future. 
Such models may be caled Intmractlwe activation models because processing 
takes place In them through the Interaction of large numbers of units of varying 
degrees of activation. In such a system, a representation is a pattern of activity 
distributed over the units In the system and the pattern of strengths of the inter- 
connections between the units. Processing amounts to the unfolding of such a 
representation In time through excitatory and Inhibitory interactions and changes 
in the Strengths of the Interconnections. The interactive activation model of 
reading (McClelland and Rumelhart, 1981; Rumelhart and McClelland, 1982) is one 
example of this approach; a thorough survey of recent developments in this field 
is available in Hbiton and Anderson (1 981 ). 

In this chapter we will discuss research currently In progress in our labora- 
tory at the University of Calfomia, San Diego. The goal of this work is to model 
speech perception as an interactive activation process. Research over the past 
several decades has made it abundantly clear that the speech signal Is extremely 
complex and rich In detail, it Is also clear from perceptual studies that human 
listeners appear able to deal with this complexity and to attend to the detail in 
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ways which are difficult to account for using traditional approaches. It is our 
belief that Interactive activation models may provide exactly the sort of compu- 
tational framework which Is needed to perceive speech. While we make no claims 
about the neural basis for our model, we do feel that the model is far more con- 
sistent with what Is known about the functional neurophysiology of the human 
brain than Is the van Neumann machine 

The chapter Is organized In the following manner. We begin by reviewing 
relevant facts about speech acoustics and speech perception. Our purpose Is to 
demonstrate the nature of the problem. We then consider several previous 
attempts to model the perception of speech, and argue that these attempts— 
when they are considered In any detail—fail to account for the observed 
phenomena. Next we turn to our modeling efforts. We describe an early version of 
the model, and present the results of several studies involving a computer simula- 
tion of the model. Then, we consider shortcomings of this version of the model. 
Finally, we describe an alternative formulation which is currently being developed. 
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THE PROBLEM OF SPEECH PERCEPTION 



There has been a great deal of research on the perception of speech over 
the past several decades. This research has succeeded in demonstrating the 
magnitude of the problem facing any attempt to model the process by which 
humans perceive speech. At the same time, important cues about the nature of 
the process have been revealed. In this section we review these two aspects of 
what has been learned about the problem. 



Why Speech Perception is Difficult 

* The segmentation problem. There has been considerable debate about 
what the 'units 1 of speech perception are. Various researchers have advanced 
arguments in favor of diphones (Klatt, 1980), phonemes (Pisoni, 1981), demisyfl- 
ables (Fujknura & Lovins, 1978), context-sensitive aliophones (Wlckelgren, 
1969), syllables (Studdert-Kennedy, 1976), among others, as basic units in per- 
ception. Regardless of which of these proposals one favors, it nonetheless 
seems clear that at various levels of processing there exist some kind(s) of unit 
which have been extracted from the speech signal. (This conclusion appears 
necessary if one assumes a generative capacity in speech perception.) It Is 
therefore usually assumed that an important and appropriate task for speech 
analysis is somehow to segment the speech input— to draw iines separating the 
units. 

The problem is that whatever the units of perception are, their boundaries 
are rarely evident in the signal (Zue & Schwartz, 1980). The information which 
specifies a particular phoneme is "encoded" in a stretch of speech much larger 
than that which we would normally say actually represents the phoneme (Liber- 
man, Cooper, Shankwefler, & Studdert-Kennedy, 1967). It may be impossible to 
say where one phoneme (or demisytlable, or word, etc.) ends and the next begins. 

As a consequence, most systems begin to process an utterance by attempt- 
ing what is usually an extremely errorfui task. These errors give rise to further 
errors at later stages. A number of strategies have evolved with the sole purpose 
of recovering from initial mistakes in segmentation (e.g., the "segment lattice" 
approach adopted by BBN's HWIM system, Bolt, Beranek, & Newman, 1 976). 

We also feel that there are units of speech perception. However, it is our 
belief that an adequate model of speech perception will be able to accomplish the 
apparently paradoxical task of retrieving these units without ever explicitly seg- 
menting the input. 



Coart icuimtory effects. The production of a given sound is greatly affected 
by the sounds which surround it. This phenomenon Is termed coertlculaUon. As an 
example, consider the manner in which the velar stop [g] is produced in the words 
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gap vs. geese. In the tatter word, the place of oral closure is moved forward 
along the velum in anticipation of the front vowel [i]. Similar effects have been 
noted for anticipatory rounding (compare the [s] in stew with the [s] In steal), tor 
nasaSzation (e*g., the [a] in can't vs. cat), and for velarization (e.g», the [n] In 
tank vs. tenth), to name but a few. Coarticulation can also result in the addition of 
sounds (consider the intrusive [t] in the pronunciation of tense as [tents]. 

We have already noted how coarticulation may make it difficult to locate 
boundaries between segments. Another problem arises as well This high degree 
off context-dependence renders the acoustic correlates of speech sounds highly 
variable. Remarkably, listeners rarely misperceive speech in the way we might 
expect from this variability, instead they seem able to adjust their perceptions to 
compensate for context Thus, researchers have routinely found that Isteners 
compensate for coarticulatory effects. A few examples of this phenomenon fol- 
low: 

* There Is a tendency in the production of vowels for speakers to "undershoot" 
the target forma nt frequencies for the vowel (Lindbkxn, 1 963). Thus, the possi- 
bility arises that the same formant pattern may signal one vowel in the context of 
a blablaJ consonant and another vowel In the context of a palatal. Listeners 
have been found to adjust their perceptions accordingly such that their percep- 
tion correlates with an extrapolated formant target, rather than the formant 
values actually attained (Undblom & Studdert-Kennedy, 1967). Oddy, It has been 
reported that vowels In such contexts are perceived even more accurately than 
vowels in Isolation (Strange, Verbrugge, & Shankweiier, 1976; Verbrugge, 
Shankweiler, & Fowler, 1976). 

* The distinction between [s] and [5] is based In part on the frequency spectrum 
of the frication (Harris, 1968; Strevens, 1960), such that when energy Is con- 
centrated in regions about 4kHz an [s] is heard. When there is considerable 
energy below this boundary, an [i] is heard. However, it Is possible for the spec- 
tra off both these fricatives to be lowered due to coarticulation with a following 
rounded vowel. When this occurs, the perceptual boundary appears to shift. 
Thus, the same spectrum wil be perceived as an [s] In one case, and as an [5] in 
the other, depending on which vowel follows (Mann & Repp, 1 980). A preceding 
vowel has a similar though smaler effect (Hasegawa, 1 976) 

* Ohman (1966) has demonstrated instances off vowel coarticulation across a 
consonant. (That is, where the formant trajectories of the first vowel in a VCV 
sequence are affected by the non-adjacent second vowel, despite the interven- 
tion of a consonant) In a series off experiments in which such stimuli were 
cross-spliced, Martin and Bunnell (1981) were able to show that listeners are 
sensitive to such distal coarticulatory effects. 

* Repp and Mann (1 981 a, 1 981 b) have reported generally higher F3 and F4 onset 
frequencies for stops following [s] as compared with stops which follow [S]. 
Parallel perceptual studies revealed that listeners 1 perceptions varied In a way 
which was consistent with such coarticulatory influences. < 
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* The identical burst of noise can cue perception of stops at different places of 
articulation. A noise burst centered at 1440 Hz followed by steady state for- 
mants appropriate to the vowels [I], [a], or [u] will be perceived as [pi [k], or 
[pl respectively (Uberman, Delattre, & Cooper, 1952). Presumably this reflects 
the manner In which the vocal tract resonances which give rise to the stop burst 
are affected during production by the following vowel (Zue, 1 976;. 

* The fonnant transitions of stop consonants vary with preceding liquids ([r] and 
DP In a way which is compensated for by listeners 1 perceptions (Mann, 1980). 
Given a sound which is intermediate between [g] and [d], listeners are more Mkeiy 
to report hearing a [g] when It is preceded by [I] than by [r]. 



In the above examples, it is hard to be sure what the nature of the relation 
is between production and perception. Are listeners accommodating their percep- 
tion to production dependencies? Or do speakers modify production to take Into 
account peculariaties of the perceptual system? Whatever the answer, both the 
production and the perception of speech Involve complex interactions, and these 
interaction* tend to be mirrored in the other modality. 

Future dependencies. We have just seen that the manner in which a 
feature or segment is interpreted frequently depends on the sounds which sur- 
round It; this is what Jakobson (1968) would have called a syntagmetlc relation. 
Another factor which must be taken Into consideration in analyzing features Is 
what other features co-occur in the same segment. Features may be resized in 
Afferent ways, depending on what other features are present. 

If a speaker is asked to produce two vowels with equal duration, amplitude, 
and fundamental frequency (FO), and one has a low tongue position (such as [a]) 
and the other has a high tongue position (e.&, [i]) the [a] wDI generally be longer, 
louder, and have a lower FO than the [i] (Peterson & Barney, 1962). This produc- 
tion dependency Is mirrored by 1st en era" perceptual behavior. Despite physical 
deferences In duration, amplitude, and and FO, the vowels produced in the above 
manner are perceived as Identical with regard to these dimensions (Chuang & 
Wang, 1978). Another example of such an effect may be found in the relation- 
ship between the place of articulation and voicing of a stop. The perceptual 
threshold for voicing shifts along the VOT continuum as a function of place, mirror- 
ing a change which occurs in production. 

In both these examples, the interaction is between feature and intra- 
segmental context, rather than between feature and trans-segmental context. 



Trading relations. A single articulatory event may give rise to multiple 
acoustic cues. This Is the case with voicing in initial stops. In articulatory terms, 
voicing is indicated by the magnitude of (VOT). VOT refers to the temporal offset 
between onset of glottal pulsing and the release of the stop. This apparently sim- 
ple event has complex acoustic consequences. Among other cues, the following 
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provide evidence for the VOT: (1) presence or absence of first formant (F1 cut- 
back), (2) voiced transition duration, (3) onset frequency of F1 , (4) amplitude of 
burst, and (5) FO onset contour. Usker (1957, 1978) has provided an even more 
extensive catalogue of cues which are available for determing the voicing of 
stops in intervocalic position. 

In cases such as the above, Where multiple cues are associated with a 
phonetic distinction, these cues exhibit what have been called "trading relations" 
(see Repp, 1981, for review). Presence of one of the cues in greater strength 
may compensate for absence or weakness of another cue. Such perceptual 
dependencies have been noted for the cues which signal place and manner of 
articulation in stops (MWer & Brnas, 1977; Oden & Massaro, 1978; Massaro & 
Oden, 1980a,b; Alfonso, 1981), voicing in fricatives (Oerr & Massaro, 1980; Mas- 
saro & Cohen, 1 976); the fricative/affricate distinction (Repp, Uberman, Eccardt, 
& Pesetsky, 1978), among many others. 

As Is the $ase with contextualy governed dependencies, the net effect of 
trading relations is that the value of a given cue can not be known absolutely. 
The listener must integrate across al the cues which are available to signal a 
phonetic distinction; the significance of any given cue interacts with the other 
cues which are present 

Rate dependencies. The rate of speech normally may vary over the duration 
of a single utterance, as well as across utterances. The changes in rate affect 
the dynamics of the speech signal in a complex manner. In general, speech Is 
compressed at higher rates of speech, but some segments (vowels, for example) 
are compressed relatively more than others (stops). Furthermore, the boundaries 
between phonetic distinctions may change as a function of rate (see MWer, 1981 
for an excellent review of this literature). 6 

One of the cues which distinguishes the stop in [ba] from the glide In [wa] is 
the duration of the consonantal transition. At a medium rate of speech a transition 
of less than approximately 60 me. causes listeners to perceive stops. (Uberman, 
Deiattre, Gerstman, & Cooper, 1966). Longer durations signal glides (but at very 
long durations the transitions indicate a vowel). The location of this boundary is 
affected by rate changes; it shifts to shorter values at faster rates (Mkiifle, KuN, 
& Stecher, 1976; MMer & Uberman, 1979). 

A large number of other important distinctions are affected by the rate of 
speech. These include voicing (Summerfield, 1974), vowel quality (Undbtom & 
Studdert-Kennedy, 1967; Verbrugge & Shankweiler, 1977), fricative vs. affricate 
(although these findings are somewhat paradoxical, Dorman, Raphael, & Uberman, 

1976). 



Phonological effects. In addition to the above sources of variability in the 
speech signal, consider the following phenomena. 
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In English, voiceless stop consonants are produced with aspiration in 
syllable-initial position (as in [p ]) but not when they foHow an [s] (as in [sp]). In 
many environments, a sequence of an alveolar stop followed by a palatal glide is 
replaced by an alveolar palatal affricate, so that did you is pronounced as [diju]. 
Also In many dialects of American (but not British) English, voiceless alveolar 
stops are 'flapped 1 Intervocalicaly following a stressed vowel (pretty being pro- 
nounced as [pnDi]). Some phonological processes may delete segments or even 
entire syllables; vowels in unstressed sylables may thus be either "reduced" or 
deleted altogether, as in policemen [pis men]. 

The above examples Illustrate phonological processes. These operate when 
certain sounds appear in specific environments. In many respects, they look like 
the contextually- governed and coarticutatory effects described above (and at 
times the distinction is in fact not clear). Phonological changes are relatively 
high-level. That Is, they are often (although not always) under speaker control. 
The pronunciation of pretty as [pnOi] is tyMcal of rapid conversational speech, 
but if a speaker Is asked to pronounce the word very slowly emphasizing the 
separate sylables, he or she wil say [pri-t"i]. Many times these processes are 
entirely optional; this is generally the case with deletion rules. Other phonological 
rules (e.g», alophonic rules) are usually obligatory. This is true of syllable-initial 
voiceless stop aspiration. 

Phonological rules vary across languages and even across dialects and 
speech styles of the same language. They represent an important source of 
knowledge listeners have about their language. It is clear that the successful 
perception of speech relies heavily on phonological knowledge. 



ft X * ft « 



These are but a few of the difficulties which are presented to speech per- 
ceivers. It should be evident that the task of the listener Is far from trivial There 
are several points which are worth making explicit before proceeding. 

First, the observations above lead us to the following generalization. There 
are an extremely large number of factors which converge during the production of 
speech. These factors interact in complex ways. Any given sound can be con- 
sidered to lie at the nexus of these factors, and to reflect their interaction. The 
process of perception must somehow be adapted to unraveling these Interactions. 

Second, as variable as the speech signal is, tnat variability is lawful. Some 
models of speech perception and most speech recognition systems tend to view 
the speech signal as a highly degraded Input with a low signal/noise ratio. This is 
an unfortunate conclusion. The variability is more property regarded as the result 
of the parallel transmission of information. This parallel transmission provides a 
high degree of redundancy. The signal is accordingly complex, but—If It Is 
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analyzed correctly— ft is also extremely robust. This leads to the next conclu- 
sion. 

Third, rather than searching for acoustic invariance (either through 
reanalysis of the signal or proliferation of context-sensitive units) we might do 
belter to look for ways in which to take advantage of the rule-governed variabil- 
ity. We maintain that the difficulty which speech perception presents is not how 
to reconstruct an impoverished signal; it is how to cope with the tremendous 
amount of information which is available, but which is (to use the term proposed 
by Uberman et al. v 1 967) highly encoded. The problem is lack of a suitable compu- 
tational framework. 



Clues About the Nature of the Process 

The facts reviewed above provide important constraints on models of speech 
perception. That is, any successful model will need to account of those 
phenomena in an explicit way. In addition, the following additional facts should be 
accounted for in any model of speech perception. ' 

High-level knowledge interacts with low-level decisions. Decisions about 
the acoustic/phonetic identify of segments are usually considered to be low- 



clause does this word belong to?" or "What are^he pragmatic properties of this 
utterance?" are thought of as high-level. In many other models of speech percep- 
tion, these decisions are answered at separate stages in the process, and these 
stages interact minimally and often only indirectly, at best, the interactions are 
bottom-up. Acoustic/phonetic decisions may supply information for determining 
word identity, but word identification has little to do with acoustic/phonetic pro- 
cessing. 

We know now, however, that speech perception involves extensive interac- 
tions between levels of processing, and that top-down effects are as significant 
as bottom-up effects. 

For instance, Ganong (1980) has demonstrated that the lexical identity of a 
stimulus can affect the decision about whether a stop consonant is voiced or 
voiceless. Ganong found that, given a continuum of stimuli which ranged perceptu- 
ally ftom gift to kift, the voiced/voiceless boundary of his subjects was dis- 
placed toward the voiced end, compared with similar decisions involving stimuli 
along a glss - kiss, continuum. The low-level decision regarding voicing thus 
interacted with the high-level lexical decision. 

In a similar vein, Isenberg, Walker, Ryder, & Schweickert (1980) found that 
the perception of a consonant as being a stop or a fricative interacted with prag- 
matic aspects of the sentence in which it occurred. In one of the experiments 

reported by Isenberg et ai., subjects heard two sentence frames: / like —Joke 

and / like drive. The target slot contained a stimulus which was drawn from a 
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to - the continuum (actually realized as [te] - [3e], with successive attenuation 
of the amplitude of the burst ♦ aspiration interval cueing the stop/fricative ex- 
tinction). For both frames to a? well as the result in grammatical sentences. How- 
ever, Joke is more often used as a noun, whereas drive occurs more often as a 
verb. Listeners tended to hear the consonant jn the way which favored the prag- 
matically plausible interpretation of the utterance. This was reflected as a shift 

In the phoneme boundary toward the [t] end of the continuum for the / //te 

joke items, and toward the [3] end for the / like drive Items. 

The role of phonological knowledge in perception has been illustrated in an 
experiment by Massaro and Cohen (1980). Listeners were asked to identify 
sounds from a [i]-[ri] continuum (where stimuli differed as to the onset frequency 
of F3). The syllables were placed after each of four different consonants; some 
of the resulting sequences were phonotacticaily permissible in English but others 
were not. Massaro and Cohen found that the boundary between [I] and [r] varied 
as a function of the preceding consonant. Listeners tended to perceive 0], for 
example, when It was proceeded by an [s], since [#sl] is a legal sequence In 
English but [|sr] is not. On the other hand, [r] was favored over [i] when it fol- 
lowed [t] since English permits [#tr] but not [#tl]. 

Syntactic decisions also interact with acoustic/phonetic processes. Cooper 
and his colleagues (Cooper, 1980; Cooper, Pacci*, & Lapointe, 1978; Cooper & 
Paccia-Copper, 1980) have reported a number of instances in which rather subtle 
aspects of the speech signal appear to be affected by syntactic properties of 
the utterance. These include adjustments In the fundamental frequency, duration, 
and the blocking of phonological rules across certain syntactic boundaries. WhHe 
these studies are concerned primarily with aspects of production, we might sur- 
mise from previous cases where perception mirrors production that listeners take 
advantage of such cues in perceiving speech. 

Not only the accuracy, but also the speed of making low-level decisions 
about speech, is influenced by higher-level factors. Experimental support for this 
view is provided by data reported by Marslen-Wilson and Welsh (1978). In their 
study subjects were asked to shadow various types of sentences. Some of the 
utterances consisted of syntacticaly and semantlcalfy well-formed sentences. 
Other utterances were syntacticaly correct but semantical^ anomofous. A third 
class of utterances was both syntactically and semantically ungrammatlcal. 
Marsien-Wllson and Welsh found that shadowing latencies varied with the type of 
utterance. Subjects shadowed the syntactically and semantically well-formed 
prose most quickly. Syntactically correct but meaningless utterances were sha- 
dowed less well. Random sequences of words were shadowed most poorly of all. 
These results Indicate that even when acoustic /phonetic analysis is possible in 
the absence of higher-level Information, this analysis— at least as required for 
purposes of shadowing— seems to be Aided by syntactic and semantic support. 

A final example of how high-level knowledge interacts with low-level deci- 
sions comes from a study by Elman, Dlehl, & BuchwakJ (1977). This study illus- 
trates how phonetic categorization depends on language context ("What 
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language am I listening to?"). Elman et ai. constructed stimulus tapes which con- 
tained a number of naturally produced one-syllable items which followed a precur- 
sor sentence. Among the items were the nonsense syllables [ba] or [pa], chosen 
so that several syllables had stop VOT values ranging from 0 ms. to 40 ms. (in 
addition to others with more extreme values). 

Two tapes were prepared and presented to subjects who were bilingual in 
Spanish and English. On one of the tapes, the precursor sentence was "Write the 
word..."; the the other tape contained the Spanish translation of the same sen- 
tence. Both tapes contained the same [ba] and [pa] nonsense stimuli. Subjects 
listened to both tapes; for the Spanish tape in which all experimental materials 
and instructions were in Spanish; the English tape was heard in an English con- 
text. 

The result was' that subjects' perceptions of the same [ba]/[pa] stimuli 
varied as a function of context. In the Spanish condition, the phoneme boundary 
was located in a region appropriate to Spanish (i.e., near 0 ms.) white in the 
English condition the boundary was correct for English (near 30 ms.). 

One of the useful lessons of this experiment comes from a comparison of the 
results with previous attempts to induce perceptual shifts in bilinguals. Earlier 
studies had failed to obtain such language-dependent shifts in phoneme boundary 
(even though blinguals have been found to exhibit such shifts in production). 
Elman et al. suggested that the previous failures were due to inadequate pro- 
cedures for establishing language context. These included a mismatch between 
context (natural speech) and experimental stimuli (synthetic speech). Contextual 
variables may be potent forces In perception, but the conditions under which the 
interactions occur may also be very precisely and narrowly defined. 

Reliance on lexical constraints. Even in the absence of syntactic or seman- 
tic structure, lexical constraints exert a powerful influence on perception; words 
are more perceptible than nonwords (Rubin, Turvey, & VanGelder, 1976). Indeed, 
this word advantage is so strong that listeners may even perceive missing 
phonemes as present, provided the result yields a real word (Warren, 1970; 
Samuel, 1979). Samuel (1980) has shown that if a missing phoneme could be 
restored in several ways (e.g., le_Jon could be restored either as legion or 
lesion), then restoration does not 'occur. 

Speech perception occurs rapidly and In one pass. In our view, an extremely 
important fact about human speech perception is that it occurs in one pass and in 
real time. Marslen- Wilson (1975) has shown that speakers are able to shadow 
(repeat) prose at very short latencies (e.g., 250 ms., roughly equal to a one syll- 
able delay). In many cases, listeners are able to recognize and begin producing a 
word before It has been completed. This is especially true once a portion of a 
word has been heard which is sufficient to uniquely determine the identity of the 
word. This ability of humans to process In real time stands in stark contrast to 
machine-based recognition systems. 
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Context effects get stronger toward the ends of words. Word endings appear 
to be more susceptible to top-down effects than word endings. Put differently, 
listeners appear to rely on the acoustic input less and less as more of a word is 
heard. 

Marslen^Wftson and Welsh (1978) found that when subjects were asked to 
shadow prose in which errors occurred at various locations In words, the subjects 
tended to restore (Le n correct) the error more often when the error occurred in 
the third syllable of a word (53%) than In the first syllable (45%). Cole, Jakimik, 
& Cooper (1978) hav*raported similar findings* On the other hand, if the task is 
to error detjtlon, as in a study by Cole and Jakimik (1978), and we 
measure reaction time, we find that subjects detect errors faster in final syll- 
ables than in Initial syllables. 

Both sets of results are compatible with the assumption that word percep- 
tion involves a narrowing of possible candidates, As the beginning of a word is 
heard, there may be many posslbilties as to what could follow. Lack of a lexical 
bias would lead subjects to repeat what they hear exactly. They would also be 
slower In detecting errors, since they would not yet know what word was 
intended. As more of the word is heard, the candidates for word recognition are 
narrowed. In many cases, a single possibility wilt emerge before the end of the 
word has been presented. This knowledge Interacts with the perceptual process 
so that less bottom-up Information Is required to confirm that the expected word 
was heard. In some cases, even errors may be missed. At the same time, when 
errors are detected, detection latency wfll be relatively fast. This Is because the 
Istener now knows what the intended word was. 
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PREVIOUS MODELS OF SPEECH PERCEPTION 



On* can distinguish two ganaral classes of models of speech perception 
which hava baan propoead. On the one hand we find models which olalm to have 
soma paychollngulstlc validity, but which ara rarely apeclfled In detail. And on tha 
othar hand are machlne-baaed speech understanding ays tarns; thoae ara neons-* 
aarlly mora exploit but do not usually olalm to ba psychological valid. 

PaychollngvlMilc modmla. Most of tha psychollngulatlo models lack tha kind 
of datall which would maka It poaslbla to tast tham empirically. It would ba diffi- 
cult, for axampla, to davalop a oomputar simulation In ordar to saa how tha models 
would work glvan raal sp*«ich Input. 

Soma of tha modals do attampt to provlda answers to tha problems men- 
tioned In the previous section. Maasaro and his colleagues (Maasaro & Oden, 
1080a, 1£80bi Oden & Maaaaro, 1078; Maasaro & Cohen, 1077) have recog- 
nized the algnlflcance of Interactlona between featurea In speech perception. 
They propose that, while acoustic cues are perceived Independently from one 
another, theae cuea are Integrated and matched against a proposition*! prototypm 
tor each apaaoh aound. The matching procedure Involvea the use of fuxzy logic 
(Zadeh, 1S72). In thla way their model expreasea the generalization that 
featurea frequently axhlblt "trading relations" with one another. The model la one 
of the few to be formulated In quantitative terms, and provides a good fK to the 
data Maaaaro and his co-workers have collected However, while we value the 
descriptive contribution of this approach, It falls to provide an adequate state- 
ment of the mechanlama required for perception to occur. 

Cole and JakbnJk (1078, 1080) have alao addresaed many of the same con- 
cerns which have been Identified here. Among other problems, they note the diffi- 
culty of segmentation, the fact that perception la sensitive to the position within 
a word, and that context plays an Important role In speech perception. Unfor- 
tunately, their observations—while Insightful and well-aubstantl^tad— hava not 
yet led to what might be considered a real model of how the speech perceivor 
eoh/es theae problems. 

The approach with which we find ouraelvea In greatest sympathy la that 
taken by Maralen- Wilson (MarsJan- Wilson, 1076, 1080s Marslen- Wilson & Tyler, 
1076s Maralen-Wllaon & Vyelah, 1078). Marslan-Wllaon haa described a model 
which la similar In spirit to Morton's (1070) logogmn model and which amp ha sizes 
the parallel and Interactive nature of apeech perception. 

In Marslan-Wllson's model, words are represented by active entitles which 
look much like logogens. Each word element Is a type of evidence-gathering 
entity; It saarchea the Input for Indications that It la present. These elements 
differ from logogens In that they are able to respond actively to mlamatches In 
the signal. Thus, while a large claaa of word elements might become active at the 
beginning of an Input, aa that Input continues many of v the worda will be 
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dlsoonf Irmod and wR) remove themselves from the pool of word candidates. Even- 
tually only a single word wNI remain, At this point the word la perceived. Maraien- 
Wilson's basic approach Is attractive because It accounts for many aspects of 
speech perception which suggest that processing Is carried out In parallel. While 
the model Is vague or falls to addreea a number of Important Issues, It Is attrac- 
tive enough so that we have used It as the basis for our Initial attempt to build an 
interactive model of speech perception. We will have more to aay about this 
model presently. 

A number of other speech perception models have been proposed, Including 
those of Ptsonl * Sawuach (1076), Cooper (1070), Uberman, Cooper, Harrta, & 
MacNeNago (1002), and HaNo & Stevens (1004), and many of these proposals 
provide partial solutions to the problem. For Instance, while there are serious dif- 
ficulties with a strong formulation of the Motor Theory of Speech Perception 
(Uberman et ah, 1002), thle theory has focused attention on an Important fact. 
Many of the phenomena which are observed In an acoustic analyst* of speech 
mppw to be puzzling or arbitrary until one understands their artlculatory founda- 
tion. There Is good reason to believe that speech perception Involves --ft not 
necessarily (MacNeilage, Rootee, & Chase, 1007) at least preferably— implicit 
knowledge of the mapping between articulation and sound, it may we* be, aa 
some have suggested (Studdert -Kennedy, 1082) that speech perception Is best 
understood aa event percept/on, that event being speech production. 

Despite Insights such as these, we feel that previous models of speech per- 
ception have serious deficiencies. 

First, these models are almost never formulated with sufficient detail that 
one can make testable predictions from them. Second, many of them simply fall to 
address certain critical problem a . For example, few models provide any account 
for how the units of speech (be they phonemes, morphemes, or words) ere Identi- 
fied given Input In which unit boundaries are almoat never present. Nor do most 
models explain how listeners are able to unravel the encoding cauaed by ooartl- 
culatlon. 

While we find the greatest agreement with Marelen-Wllaon's approach, there 
lire a number of algnlflcant questions his model leaves unanswered. (1) How do 
the word elements know when they match the Input? The failure of many 
machine-based speech recognition systems Indicates this Is far from trivial prob- 
lem. (2) Do word elements have Internal structure? Do they encode phonemes 
and morphemes? (3) How Is serial order (of words, phonemes, morphemes, etc.) 
represented? (4) How do we rooognlzo nonwords? Must we posit a separate 
mechanism, or Is there aome wey in which the same mechanism can be used to 
perceive both words and nonwords? (0) How Is multi-word Input perceived? What 
happens when the Input may be parsed In several waya, either aa one long word 
or several smaller words (e.g., mil ye light vs. cmllulltm)? These are all Important 
questions wNch are not addressed. 
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Machine-based models. It might seem unfair to evaluate machine-based 
speech recognition systems as models of speech perception, since most of them 
do not purport to be such. But as. Norman (1880) has remarked in this context, 
"nothing succeeds like success." The perceived success of several of the 
speech understanding systems to grow out of the ARPA Speech Understanding 
Research project (see Klatt, 1977, for review), has had a profound influence on 
the field of human speech perception. As a result, several recent models have 
been proposed (e.g., Klatt, 1980; Newel, 1980) which do claim to modal human 
speech perception, and whose use of pre-compiled knowledge and table look-up 
is explcttty justified by the success of the machine-based models. For these 
reasons, we think the machine-based systems must be considered seriously as 
models of human speech perception. 

The two best known attempts at machine recognition of speech r,e HEAR- 
SAY and HARPY. 

HEARSAY (Erman & Lesser, 1980; Carnegie-Mellon, 1977) was the more 
explicitly psychologically-oriented of the two systems. HEARSAY proposed 
several computationally dstinct knowledge sources, each of which could operate 
on the same structured data base representing hypotheses about the contents of 
a temporal window of speech. Each knowledge source was supposed to work In 
parallel with the others, taking Information from a central "blackboard" as It 
became available, suggesting new hypotheses, and revising the strengths of oth- 
ers suggested by other processing levels. 

Although conceptually attractive, HEARSAY was not a computationally suc- 
cessful model (in the sense of satisfying the ARPA SUR project goals, Klatt, 
1977), and there are probably a number of reasons for this. One central rftason 
appeared to be the sheer amount of knowledge that had to be brought to bear in 
comprehension of utterances— even of utterances taken from a very highly con- 
strained domain such as the specification of chess moves. ' Knowledge about 
what acoustic properties signaled which phonemes, which phonemes might occur 
together and how those co-occurrartces condition the acoustic properties, 
knowledge of which sequences of speech sounds made legal words In the res- 
tricted language of the system, knowledge about syntactic and semantic con- 
straints, and knowledge about what K made sense to say in a particular context 
had to be accessible. The machinery available to HEARSAY (and by machinery we 
mean the entire computational approach, not simply the hardware available) was 
simply not sufficient to bring al of these considerations to bear in the comprehen- 
sion process In anything close to real time. 

Three other problems may have been the fact that the analysis of the 
acoustic Input rarely resulted in unambiguous Identification of phonemes; the dif- 
ficulties In choosing between which hypotheses would most profitably be pursued 
first (the "focus of attention" problem); and the fact that the program was com- 
mitted to the notion that the speech Input had to be segmented kitp separate 
phonemes for identification. This was a very errorful process. We will argue that 
this step may be unnecessary in a sufficiently parallel mechanism. 
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The difficulties faced by the HEARSAY project with the massive paralel 
computation that was required for successful speech processing were avoided by 
the HARPY system (Lowerre & Reddy, 1980; Carnegie-Mellon, .1977). HARPY'S 
main advantage over HEARSAY was that the various constraints used by HEAR- 
SAY In the process of interpreting an utterance were pre-complled into HARPY's 
computational structure, which was an integrated network. This meant that the 
extreme slowness of HEARSAY'S processing could be overcome; but at the 
expense, it turned out, of an extremely long compilation time (over 12 hours of 
time on a DEC-10 computer). This trick of compiling in the knowledge, together 
with HARPY's Incorporation of a more sophisticated acoustic anklysis, land an effi- 
cient graph-searching technique for pruning the network ("beam search"), made it 
possible for this system to achieve the engineering goals established for it. 

However, HARPY leaves us at a dead end. Its knowledge Is frozen into Its 
structure and there is no natural way for knowledge to be added or modified. It is 
extremely unikely that the simplified transition network formalism underlying 
HARPY can actually provide an adequate formal representation of the structure of 
language or the flexibility of its potential use in real contexts. 



Both the psycholnguistic and the machine models share certain fundamental 
assumptions about how the processing of speech Is best carried out. These 
assumptions derive, we feel, from the belief that the van Neumann digital com- 
puter is the appropriate metaphor for information processing In the brain. This 
metaphor suggests that processing Is carried out as a series of operations, one 
operation at a time; that these operations occur at high speeds; and that 
knowledge is stored in random locations (as in Random Access Memory) and must 
be retrieved through some search procedure. These properties give rise to a 
characteristic processing strategy consisting of Iterated hypotheslze-and-test 
loops. (It is curious that even In the case of HEARSAY, which came closest to 
escaping the van Neumann architecture, the designers were unwilling to abandon 
this fundamental strategy.) 

Yet we note again how poorly this metaphor has served in developing a model 
for human speech perception. Let us now consider an alternative. 
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THE INTERACTIVE ACTIVATION MODEL OF SPEECH PERCEPTION 



The Philosophy Underlying the Preaent Model 

In contraat to HARPY and HEARSAY, we do not believe that It la reasonable 
to work toward a computational ayatem which can actually process speech In real 
time or anything close to It. The neceaaary parallel computational hardware aim- 
ply does not exist for this task. Rather, we believe that It will be more profitable 
to work on the development of parallel computational mechanisms which seem In 
principle to be capable of the actual teak of apeeoh perception, given sufficient 
elaboration In the right kind of hardware, and to explore them by running necea- 
aarlly slow almulatlona of massively parallel ayatema on the available computa- 
tional toola. Once we underatand these computational mechanisms, they can be 
embodied In dedicated hardware specially designed and Implemented through very 
largo scale Integration (VLSI). 

Again In contraat to HARPY and HEARSAY, we wish to develop a model which 
Is consistent with what we know about the psychology and physiology of speech 
perception. Of course this Is sensible from a point of view of theoretical peychol- 
ogy. We believe It la also sensible from the point of view of designing an ade- * 
quate computational mechanism The only exlating computational mechanlam that 
can perceive apeeoh la the human nervous system. Whatever we know about the 
human nervous system, both at the physiological and psychologies! levels, pro- 
vides us with useful clues to the structure and the types of operations of one 
computational mechanlam which Is succeaaful at speech perception. 

We have already reviewed the psychological constraints, In considering rea- 
eons why the problem of apeeoh perception la difficult and In exploring poaalble 
cluea about how It occurs. In addition, there are a few thlnga to be aakJ about 
the physiological oonatralnts. 

What la known about the phyalology la very little Indeed, but we do know the 
following. The lowest level of analysis of the auditory elgnal la apparently a cod- 
ing of the frequency apectrum preaent In the Input. There la also evidence of 
aome elngle-unlt detectors In lower-order mammals for transitions In frequency 
either upward or downward, and aome single units respond to frequency transi- 
tions away from a particular target frequency (Whitfield & Evana, 1006). 
Whether such single units actually oorreapond to functional detectors for these 
properties Is of course Nghly debatable, but the aparse evidence la at leaat con- 
sistent with the notion thst there are detectors for properties of the acouatlc 
signal beginning at the to wast level with detectors for the particular frequencies 
present In the signal. Detectors may well be dlatrlbuted over large populations of 
actual neurons, of course. 

More fundamentally, we know that the brain Is a highly Interconnected aya- 
tem. The number of neurons In the cortex (conservatively, 10 billion) la not nearly 
aa Impressive as the number of synapses— perhaps as many aa 10 . The 
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connectivity of cortical celts is such that a change of state in one area is likely to 
influence neurons over a very wide region. 

We know also that neuronal conductivity is relatively slow, compared with 
digital computers. Instruction cycle times of digital computers are measured on 
the order of nanoseconds; neuronal transmission times are measured on the order 
of mliiseconds. Where does the power of the human brain come from, then? We 
suggest it derives from at least these two factors: the interconnect edness of 
the system, and the ablity to access memories by content. Content addressable 
memory means that Information can be accessed directly instead of accessed 
through a sequential scan of randomly ordered items. 

This leads us toward a model which is explicitly designed to deal with all of 
the constraints outlined above. We have adopted the following "design princi- 
ples: 1 ' 

• The model should be capable of producing behavior which is as simi- 
lar as possible to human speech perception. We consider experi- 
mental data to be very important in providing constraints and clues 
as to the model's design. The model should not only perform as well 
as humans, but as poorly in those areas where humans fal. 

a The model should be constructed using structures and processes 
which are plausible given what we know about the human nervous 
system. We do not claim that the model is an Image of those neu- 
ronal systems which are actually used In humans to perceive 
speech, since we know next to nothing about these mechanisms. 
But we have found that mechanisms which are inspired by the 
structure of the nervous system offer considerable promise for pro- 
viding the kind of parallel information processing which seems to be 
necessary. 

• The model should not be constrained by the requirement that com- 
puter simulations run in real time. Parallel processes can be simu- 
lated, on a serial digital machine, but not at anything approaching 
real-time rates. The goal of real time operation at this point would 
be counter-productive and would lead to undesirable compromises. 



The COHORT Model 

Our initial attempt to construct a model which met these requirements was 
called the COHORT model, and it was an attempt to Implement the model of that 
name proposed by Marslen- Wilson and Welsh (1978). Of course, In implementing 
the model many details had to be worked out which were not specified in the ori- 
ginal so the originators of the basic concept cannot be held responsible for all of 
the^ncdsl's shortcomings. COHORT was designed to perceive word Input, with 
the Input specified In terms of time- and strength-varying distinctive features. It 
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Is based on a lexicon of the 3846 most common words (occurring 10 or more 
times per million) from the Kucera & Francis corpus (Kucera & Francis, 1967). 

Each of the features, phonemes, and words is represented by a node. Nodes 
have roughly the same computational power as is traditionally ascribed to a neu- 
ron. Each node has.*. 

-.an associated /eve/ of activation which varies over time. These levels may 
range from some minimum value usually near -.2 or -.3 to a maximum, 
usually set at +1.0; 

..a threshold (equal to 0); when a node's activation level exceeds this 
threshold it enters what is called the active state and begins to sig- 
nal its activation value to other units; 

-Jts own (sub-threshold) resting level of activation to which it returns in 
the absence of any external Inputs. 

Each node may be linked to other nodes in a non-random manner. These con- 
nections may be either excitatory or Inhibitory. When a node becomes active, it 
excites those nodes to which it has excitatory connections, and inhibits nodes to 
which It has inhibitory connections by an amount proportional to how strongly its 
activation exceeds threshold. These connections have associated weightings, 
such that some inputs may have relatively greater impact on a node than others. 

A node's current activation level reflects several factors: (1) the node's ini- 
tial resting level; (2) the spatial and temporal summation of previous inputs (exci- 
tatory and Inhibitory); and (3) the node's rate of decay. 

A fragment of the system Just described Is illustrated in Figure 1. At the 
lowest level we see the nodes for the acoustic/phonetic features. COHORT 
makes use of a set of 22 nodes for 1 1 bipolar features which are modifications of 
the Jakobsonian distinctive features (Jakobson, Fant, & Hade, 1952). These 
nodes are activated directly by the Input to ttro model (described below). The 
features were choeen for the Initial working model for several reasons. They 
have proven useful In the description of certain linguistic phenomena (such as 
sound change) which suggests* they have some psychological realty; the Jakob- 
sonian features are defined in (sometimes vague) acoustic terms; and recent 
work by Blumstein and Stevens (1980; Stevens & Blumstein, 1981) appears to 
confirm that some of the features might serve as models for more precise acous- 
tic templates. 

At the next higher level are the nodes for phonemes. COHORT has nodes for 
37 different phonemes, Including an abstract unit which marks the end of words. 
All phonemes except the end of work marker receive excitatory Inputs from those 
features which signal their presence. Thus, the node for /p/ is activated by input 
from the nodes gmve, compact, consonantal, oral, voiceless* etc. 
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Figure 1 . Fragment of the COHORT system. Nodes exist for features, phonemes, 
and words. The word nodes have a complex schema associated with them, shown 
here only for the word bilss. Connections between nodes are Indicated by arcs; exci- 
tatory connections terminate in arrows and inhibitory connections terminate in filled 
circles. 
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Before describing the word nodes, ti comment is in order regarding the 
features and phonemes which were used in COHORT. These choices represent 
initial simplifications of very complicated theoretical issues, which we have 
chosen not to broach at the outset Our goal has been to treat the model as a 
starting place for examining a number of computational issues which face the 
development of adequate models of speech perception, and it is our belief that 
many of these issues are independent of the exact nature of the assumptions we 
make about tha features. The Jakobsonian feature set was a convenient starting 
point from thisTWnt of view, but it should be clear that the features in later ver- 
sions of the model wil need substantial revision. The same caveat is true regard* 
ing the phonemes. It is even conceivable that some other type of unit will ulti- 
mately prove better. Again, to some degree, the precise nature of the unit 
(phoneme, demisyllable, context-sensitive altophone, trans erne, etc.) is dissoci- 
able from the structure in which it is embedded. 

It might be argued that other choices of units, would simplify the problem of 
speech perception considerably and make it unnecessary to invoke the complex 
computational mechanisms we will be discussing below. Indeed, some of the units 
which have been proposed as alternatives to phonemes have been suggested as 
answers to the problem of context-sensitive variation. That is, they encode — 
frozen into their definition— variations which are due to context. For example, 
context-sensitive allophones (Wlckelgren, 1 069) attempt to capture differences 
the the realizations of particular phonemes in different contexts by Imagining that 
there Is a different unit for each dfferent context. We think th!* merely post- 
pones a problem which is pervasive throughout speech perception. In point of 
fact, none of these alternatives is able to truly solve the variability which 
extends over broad contexts, or which is due to speaker differences, or to 
changes in rate of articulation. For this reason we decided to begin with units 
(phonemes) which are frankly context-insensitive, and to see if their variability in 
the speech stream could be dealt with through the processing structures. 

Let us turn now to the word nodes. Words present a special problem for 
COHORT. This is because words contain Internal structure. In the current version 
of the system, this structure is limited to phonemes, but it is quite likely that word 
structure also contains information about morphemes and possibly syllable boun- 
daries. To account for the fact that words are made up of ordered sequences of 
phonemes, it seems reasonable to assume that the perceiver's knowledge of 
words specifies this sequence % 

Word nodes are thus complex structures. A node network which depicts a 
word structure is shown for the word bliss in Figure 1 . The schema consists of 
several nodes, one for each of the phonemes In the word, and one for the word 
itself. The former are called token nodes, sinca there is one for each occurrence 
of each phoneme in the word. The latter is simply called the word node. At the 
end of each word there is a special token node corresponding to a word boundary. 
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Token nodes have several types of connections. Token-word connections 
permit tokens to excite their word node as they become active (pass threshold). 
Word-token Inks allow the word node to excite its constituent tokens. This 
serves both to reinforce tokens which may have already received bottom-up 
Input, as well as to prime tokens that have not yet been "heard." Phoneme-token 
connections provide bottom-up activation for tokens frum phonemes. Finally, 
token-token connections let active tokens prime successive tokens and keep pre- 
vious tokens active after their bottom-up input has <fs appeared. Because 
Isteners have some expectation that new input wil match word beginnings, the 
first token node of each word has a slightly higher resting level than the other 
tokens. (In some simulations, we have also set the second token node to an 
Intermediate level, lower than the first and higher than the remaining tokens). 
Once the first token passes threshold, it excites the next token in the word. This 
priming, combined with the order in which the Input actually occurs, is what per- 
mits the system to respond differently to the word pot than to top. 

In addition to internal connections with their token nodes, word nodes have 
inhibitory connections with all other word nodes. TNs inhibition reflects competi- 
tion between word candidates. Words which match the input will compete with 
other words which do not, and will drive their activation levels down. 



Word recognition in COHORT 

To further illustrate how COHORT works, we will describe what is involved in 
recognizing the word slender. 

COHORT does not currently have the capability for extracting features from 
real speech, so we must provide it with a hand-constructed approximation of 
those features which would be present in the word slender. Also, since the model 
is simulated on a digital computer, time is represented as a series of (Secrete 
samples. During each sampling period COHORT receives a Nst of those features 
which might be present during that portion of the word. These features have 
tim e varying strengths. To simulate one aspect of coarticulation, the features 
overlap and rise and fal in strength. 

At the beginning of the simulation, ail nodes in the system are at their resting 
levels. During the first few sampling periods the feature nodes receive activation 
from the input, but their acth/atibn levels remain below threshold. Eventually, how- 
ever, some feature nodes become active and begin to excite al the phone mes 
which contain them. In the present example, activation of the features for /a/ 
results In excitation of the phonemes /z/ 9 /f /, and /v/ as wed as /s/. This is 
because these ctt»ar phonemes closely resemble /s/ and contain many of the 
same features. The /s/, however, Is most strongly activated. 

The next thing that happens is that active phonemes excite their 
corresponding token nodes in all the words that contain those phonemes. Initial 
token nodes (such as the /s/ in slender) are more likely to pass threshold than 
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word-internal nodes (such as the /s/ in twist) since these nodes have higher 
resting levels. When the token nodes bedome active, they begin to activate word 
nodes and also their successor token nodes. Of course, while ail this happens, 
Input continues to provide bottom-up excitation. 

As time goes on, the internal connections begin to play an increasing role in 
determining the state of the system. Once word nodes become active they pro- 
vide a strong source of top-down excitation for their token nodes and also com- 
pete with one another via inhibitory connections. Early in the input there may be 
many words which match the input and are activated. These will connate with one 
another but none will be able to dominate; however, they will drive down the 
activations of other words. Those words which fail to continue to receive 
bottom-up excitation WiS fall away; both through their own decay and through 
inhibition frorr more successful candidates. Eventually only a single word will 
remain active and will push down the activation levels of unsuccessful word 
nodes. 

One can monitor these events by examining the activation levels of the vari- 
ous types of nodes in the system. In Figure 2, for example, we see a. graph of the ✓ 
activation levels of word nodes, given Input appropriate to the word slender. At 
time *0 the word nodes 1 activation levels rest Just below threshold. During the 
first 15 or so time cycles the activation levels remain constant, since it takes a 
while for the feature, phoneme, and token nodes to become active and excite 
the word nodes. After this happens a large number of words become active. 
These are all the words which begin with the phoneme /s/. Shortly after the 
25th cycle features for the phoneme /I/ are detected and words such as send 
fall away, but other words such as slim remain active. When the /e/ is detected 
slim and similar words are inhibited. At the end only slender remains active. 

This simulation reveals two interesting properties of COHORT. First, we note 
that occasionally new words such as lend and endless Join the cohort of active 
words. Even though they do not begin with /s/ they resemble the input enough to 
reach threshold. We regard this as desirable because it is clear that human 
listeners are able to recover from initial errors. One problem we have found in 
other simulations is that COHORT does not display this behavior consistently 
enough. 

Secondly, we see that the word node for slender begins to dominate surpris- 
ingly early in time. In fact, it begins to dominate at just the point where it pro- 
vides a unique match to the input. This agrees with Marslen-Wilson's (1080) 
daim that words are recognized at the point where they become uniquely identifi- 
able. 

We can also monitor the activation levels of the tokens within the word 
schema for slender, as shown in Figure 3. At time t Q all tokens are below thres- 
hold, although /s/ is near threshold and the /I/ is also slightly higher than the 
remaining tokens. (Recall that the initial tokens have higher resting levels, 
reflecting percelver's expectations for hearing first sounds first.) The /s/ token 
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Figure 2. Activation levels of selected word nodes, given feature inputs ap- 
propriate for the word slender. At the start all words which begin with s are activat- 
ed. As time goes on only those words which more closely resemble the input remain 
active; other words are decay and are also inhibited by the active nodes. Finally only 
the node for slender dominates. 
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Figure 3. Activations of the token nodes associated with slender, given input 
appropriate for this word. 
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passes threshold fairly quickly. When It becomes active it excites both the 
slender word node and also the next token in the word, /I/. After more cycles, the 
/I/ token begins to receive bottom-up input from the feature nodes, and the 
/s/'s feature input decreases. 

The same basic pattern continues throughout the rest of the word, with some 
differences* The level of nodes rises slowly even before they receive bottom-up 
input and become active. This occurs because the nodes are receiving lateral 
priming from earler tokens in the word, and because once the word node becomes 
active it primes all its constituent token nodes. This lateral and top-down excita- 
tion is also responsfcle for the tendency of token nodes to increase again after 
decaying once bottom-up input has ceased (for example, /s/'s level starts to 
decay at cycle 25, then begins to Increase at cycle 30). By the end of the word, 
all the tokens are very active, despite the absence of any bottom-up excitation. 

This example demonstrates how COHORT deals with two of the problems we 
noted in the first section. One of these .problems, it will be recalled, is the 
spreading of features which occurs as a result of coarticulation. At any single 
moment in time, the signal may contain features not only of the "current* 1 phoneme 
but also neighboring phonemes. In the current version of COHORT we provide the 
simulation with hand-constructed input In which this feature spreading Is artifi- 
cially mimicked. Because COHORT is able to activate many features and 
phonemes at the same time, this coarticulation helps the model anticipate 
phonemes which may not, properly speaking, be fully present. In this way coarti- 
culation is treated as an aid to perception, rather than as a source of noise. 
While the sort of artificial Input we provide obviously does not provide the same 
level of difficulty which is present in real speech, we believe that COHORTS 
approach to dealing with these rudimentary aspects of coarticulation is on the 
right track. 

A second problem faced by many speech recognition systems is that of seg- 
mentation: How do you locate units in a signal which contains few obvious unit 
boundaries? For COHORT this problem simply never arises. As the evidence for 
different phonemes waxes and wanes, the activation levels of phonemes and 
tokens rises and fals in continuous fashion. Tokens which are active ted in the 
right sequence (Le., belong to real words) activate word nodes, which are then 
able to provide an additional source of excitation for the tokens. At the end of 
the process, all the phoneme tokens of the word that has been heard are active, 
but there is no stage during which explicit segmentation occurs. 

In addition to these two characteristics, COHORT can be made to simulate 
two phenomena which have been observed experimentally in human speech per- 
ception. The first of these phenomena is phonemic restoration. 

The human speech processing system Is capable of perceiving speech in the 
face of considerable noise. This ability was studied in an experiment by Warren 
(1070). Warren asked subjects to listen to tapes containing naturally produced 
words in which portions of the words had been replaced by noise. Warren found 
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that, although subjects were aware of the presence of noise, they were unaware 
that any part of the original word had been deleted (In fact, they were usually 
unable to say where in the word the noise occurred). Samuel (in press) has repli- 
cated and extended these using a signal detection paradigm. (In Samuel's 
experiments, some stimul have phonemes replaced by noise and other stimuli 
have noise added In. The subjects 1 task is to determine whether the phoneme is 
present or absent) One of Samuel's important findings is that this phenomenon, 
phonemic restoration, actually completes the percept so strongly that it makes 
subjects insensitive to the distinction between the replacement of a phoneme by 
noise and the mere addition of noise to an intact speech signal. Listeners actu- 
ally perceive the missing phonemes as if they were present. 

We were interested in seeing how COHORT would respond to stimuli in which 
phonemes were missing. To do this, we prepared Input protocols in which we 
turned off feature input during those cycles which corresponded in time to a par- 
ticular phoneme. In one of these simulations, we deleted all feature input for the 
/d/ of slender. (Note that this differs sightly from the standard phonemic res- 
toration experiment, in which noise is added to the signal after a phoneme is 
deleted.) 

In Figure 4 we observe the activations of the slender token nodes which 
result from this input. These levels may be compared with those in Figure 3. There 
, are no obvious differences between the two conditions* The /d/ token succeeds 
in becoming active despite the absence of bottom-up input. This suggests that 
the token-token priming and the top-down excitation from word to token is a 
powerful force during perception. 

Figure 5 compares the word node activation for slender with and without /d/ 
input. The two patterns are remarkably alike. COHORT appears to respond much 
as human perceivers do given similar input — the distinction between the pres- 
ence and the absence of the /d/ is lost in context. 

A second phenomenon we attempted to replicate with COHORT was the lexi- 
cal bias in phoneme Identification first noted by Ganong (1080). As previously 
mentioned, Ganong ds covered that if Isteners are ask ad to identify the Initial 
consonant In stimul which range perceptualy from a word to a nonword, the 
phoneme boundary is displaced toward the word end of the continuum, compared 
with its location on a non-word/word continuum. In short, lexical status affects 
perception at the level of phonetic categorization. 

In order to simulate this experiment, we presented COHORT with input which 
corresponded to a word-Initial bilabial stop, followed by features for the 
sequence _ar. The feature values for the bilabial stop were adjusted In such a 
way as to make it Indeterminate for voicing; It sounded midway between tar and 
par. Although COHORT knows the word bar, ft does not have par In its lexicon, so 
par Is effectively a nonword for the purposes of the simulation. 
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Figure 4. Activations of the token nodes associated with slender, given input 
appropriate for this word* 
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Figure 6. Activation levels of the slender word node for Input In which the d Is 
present (solid line), compered to when the d is absent (broken line). 
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The simulation differed from Ganong's experiment In that he measured the 
phoneme boundary shift by presenting a series of stimuli to subjects and then 
calculating the boundary as the location of the 50% labelling crossover* In our 
experiment we were able to present the model with a stimulus which should have 
been exactly at the p honeme boundary, assuming a neutral context (e.g., If the 
stimulus had been a nonsense syllable such as b* or pa rather than a potential 
word). The way we determined whether or not a lexical effect similar to Ganong's 
had occurred was to examine the activation levels of the /b/ and /p/ phoneme 
nodes. 

Figure 0 shows the activation levels of these two nodes over the time 
course of processing the input stimulus. Both nodes become highly activated dur- 
ing the first part of the word. This is the time when bottom-up Input is providing 
equal activation for both voiced and voiceless bilabial stops. Once the bottom-up 
input is gone, both levels decay. What is of Interest Is that the /b/ node remains 
with a higher level of activation. We assume that this higher level would be 
reflected In a boundary shift on an phoneme identification test toward the voiced 
end of the continuum. 

When we think about why COHORT dsplays this behavior— behavior which is 
similar to those of Ganong's human subjects— we realize that the factors respon- 
sible for the greater activation of the /b/ node are essentially the same which 
cause phonemic restoration. Top-down excitation from the word level exerts a 
strong Influence on perception at the phoneme level. 

This realization leads to an Interesting prediction. Because the lexical effect 
reflects the contribution of top-down Information, it should be the case that when 
the target phoneme (I.e., the one to be identified) occurs later In the word, rather 
than at the beginning as Is the case with the bmr/pmr stimulus., the difference In 
activations of the two competing nodes should be magnified. This is because the 
word node has had longer to buHd up its own activation and is therefore able to 
provide greater support for the phoneme which is consistent with It 

Figure 7 demonstrates that COHORT does Indeed perform in this manner. We 
presented the simulation with input appropriate to the sequence ra_foJlowed by a 
bilabial stop that was again intermediate with regard to voicing, rob is a word In 
COHORTS lexicon, but rop is hot so we would expect a greater level in activa- 
tion for /b/ than for /p/, based on top-down excitation. 

This Indeed occurs. But what we also find Is that the magnitude of the differ- 
ence Is sightly greater than when the target phoneme occurs at the beginning of 
the word. The finding has not yet been tested with human perceh/ers, but it Is 
consistent with other findings mentioned above (Cole & Jakimik, 1078, 1080; 
Marslen-Wllson & Welsh, 1078) which point to greater top-down effects at word 
endings than at word-beginnings. 
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Figure 6. Activation of b and p phoneme nodes, given feature input for the se- 
quence bilabial stop+a+r, in which the stop is Indeterminate for voicing. Since the 
lexicon contains the word tar but not par, top-down excitation favors the perception 
of the stop as voiced. 
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Figure 7. Activation of b and p phoneme nodes, given feature input for the se- 
quence r+a+bilabial stop, in which the stop is indeterminate for voicing. The lexicon 
contains the word rob, but not rop, so the b node becomes more activated than the p 
node. 
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In simulating (kwong's lexical effect on the phoneme boundary, we added a 
provision to COHOtT which was not provided for by Marslen-Wllson and Welsh 
(1078): Feedback from the word to the phoneme level They, along with Morton 
(197Q) have accounted for lexical and other contextual effects on phoneme 
identification In terms of a two step process, in which context affects word iden- 
tification, and then the phonological structure of the word is unpacked to deter- 
mine what phone m e s It contains. 

The alternative we prefer is to permit feedback from the words to actually 
influence activations at the phone me level. In this way, partial activations of 
words can influence perception of nonwords. 

The addition of feedback from the words to the phoneme level in cohort 
raises a serious problem, however. If the feedback is strong enough so that the 
phoneme nodes within a word are kept active as the perceptual process unfolds, 
then all words sharing the phonemes which have been presented continue to 
receive bottom-up support and the model begins to loose its ablity to dsthguish 
words having the same phonemes in them In different orders. This and other prob- 
lems, to be reviewed below, have lead us to a different version of an Interactive 
activation model of speech perception, called TRACE. * 



The TRACE Model 

Given COHORTS successes, one might be tempted to suggest that it may be 
feedback to the phoneme level, and not the rest of the assumptions of COHORT 
which are In error. However, there are other problems as wett with this version of 
the model. First, words containing multiple occurrences of the same phoneme 
present serious problems for the model. The first occurrence of the phoneme 
primes al the tokens of this ptqpeme in words containing this phoneme anywhere. 
Then the second occurrence pushes the activations of all of theee tokens into 
the active range. The result is that words containing the repeated phoneme «iy- 
where in the word become active. At the same time, all words containing multiple 
occurrences of the twice-active phoneme get so strongly activated that the 
model's ablity to distinguish between them based on subsequent (or prior) input 
is diminished A second dfflcuity Is that the model Is too sensitive to the dura- 
tions of successive phonemes. When durations are too short they do not allow for 
sufficient priming. When they are too long too much priming occurs and the words 
being to "run away" independently of bottom-up activation. 

In essence, both of these problems come down to the fact that COHORT 
uses a trick to hands the sequential structure of words: it uses lateral priming of 
one token by another to prepare to perceive the second phoneme in a word after 
the first and so on. The problems described above arise from the fact that this Is 
a highly unreliable way of solving the problem of the sequential structure of 
speech. To handle this problem there needs to be some better way of directing 
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the input to the appropriate place In the word. 

Sweeping the input across the tokens. One way to handle some of these 
problems Is to assume that the input is sequentially directed to the successive 
tokens of each word Instead of successive priming of one token by the next, we 
could Imagine that when a token becomes active, It causes subsequent Input to 
be gated to its successor rather than itself. All Input, of course, could be 
directed initialy toward the first token of each word If this token becomes 
active, it could cause the Input to be redirected toward the next token. This 
suggestion has the Interesting property that It automatically avoids double 
activation of the same token on the second presentation of the corresponding 
phoneme. It may still be sensitive to rate variations, though this could be less of 
a problem than In the proceeding model. Within word filling in could still occur via 
the top-down feedback from the word node, and of course this would take a while 
to build up so would be more likely to occur for later phonemes than for earlier 
ones. 

However, this scheme shares a serious problem with the previous one. In the 
absence of prior context, both versions depend critically on clear word beginnings 
to get the right word schemes started. We suspect that it is Inferior to human 
pare elvers in this respect. That is, we suspect that humans are able to recognize 
words correctly from their endings (In so far as these are unique) even when the 
beginnings are sufficiently noisy so that they would produce only very weak 
word-level activations at first and thus would not get the bail rolling through the 
word tokens. 

Generalized sweeping. A potential solution to this problem would be to 
sweep the Input through all tokens, not just those in which the input has already 
produced activations. However, It is not clear on what basis to proceed with the 
sweep. If It were possible to segment the Input Into phonemes then one could 
step along as each successive phoneme came In; but we have argued that there 
is no segmentation into phonemes. Another possibility is to step along to the next 
token as tokens become active at the current position in any words. Though this 
does not require explicit segmentation of the input, It has Its drawbacks as wel. 
For one thing it means that the model is somewhat rigidly committed to its position 
within a word. It would be difficult to handle cases where a nonsense beginning 
was followed by a real word (as In, say, unptlcohort), since the model would be 
directing the ending toward the ends of longer words rather than toward begin- 
nings. 

The memory trace. A problem with all of the schemes considered thus far is 
that they have no memory, except within each word token. Patterns of activation 
at the phoneme level come and go very quickly — if they do not, confusion sets 
In. The fact that the memory is all contained within the activations of the word 
tokens makes It hard to account for context effects In the perception of pseudo- 
words (Samuel, 1970). Even when these stimuli are not recognized as words, 
missing phonemes which are predictable on the basis of regularities In patterns of 
phoneme co-occurrance are nevertheless filled in. Such phenomena suggest that 
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there is a way of retaining a sequence of phonemes — and even filing in missing 
pieces of It — when that sequence does not form a word* One possiblity is to 
imagine that the activations at the phoneme level are read out into some sort of 
post-identification buffer as they become active at the phoneme level. Whle this 
may account for some of the pseudoword phenomena, retrospective filing In of 
missing segments would be dtfflcutt to arrange. What appears to be needed Is a 
dynamic memory in which incomplete portions of past inputs can be filed In as the 
Information which specifies them becomes avalable. The TRACE model attempts 
to incorporate such dynamic memory into an Interactive activation system. We are 
only now in the process of implementing this model via a computer simulation, so 
we can only offer the foiowing sketch off how it wll work. 

We propose that speech perception takes place within a system which 
possesses a dynamic representational space which serves much the same func- 
tion as the Blackboard in HEARSAY. We might visualize this buffer as a large set 
of banks of detectors for phonetic features and phonemes, and imagine that the 
input sweeps out a pattern of activation through this buffer. That is, the input at 
some initial time t Q would be directed to the first bank of detectors, the Input at 
the next time sice would be directed to the next bank, and so on. These banks 
are dynamic; that is, they contain nodes which interact with each other, so that 
processing will continue in them after bottom-up input has ceased. In addition to 
the Interactions within a time slice, nodes would interact across slices. Detectors 
for mutualy incompatible units would be mutualy inhibitory, and detectors for the 
units representing an item spanning several sices would support oach other 
across sices. We assume in this model that Information written Into a bank would 
tend to decay, but that the rate of decay would be determined by how strongly 
the incoming speech pattern set up mutually supportive patterns of activation 
within the trace. 



Above the phoneme model, we presume that there would be detectors for 
words. These, of course, would span several slices of the buffer. It s eem s 
unreasonable to suppose that there Is an existing node network present contain- 
ing nodes for each word at each poesble starting position In the buffer. It seems 
then, that the model requires the capabilty of creating such nodes when It needs 
them, as the input c o me s la Such nodes, once created, would be interact with 
the phoneme buffers in such a way as to insure that only the correct sequence of 
phonemes wfl strongly activate them. Thus, the created node for the word cat 
starting in some slice will be activated when there is a /c/ in the starting slice 
and a few subsequent slices, an /a/ in the next few slices, and a /t/ In the next 
few, but will not be excited (except for the /a/) when these phonemes occur in 
the reverse order. 

k shnplfied picture of the TRACE model is shown in Figure 8. Time Is 
represented along the horizontal axis, with successive columns for individual 
memory traces. Within each trace there are nodes for features and phonemes, but 
only phoneme nodes are shown here. The activation level of each of these nodes 
(and of the word nodes above) is shown as a horizontal bar; thicker bars indicate 
greater levels of activation. 
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Figure 8. Partial view of the TRACE system. Time is represented along the hor- 
izontal axis, with columns for succeeding "traces." Each trace contains nodes for 
phoneme and feature nodes (only the phoneme nodes are shown). Input is shown 
along the bottom in phonemic form; in reality, input to the phoneme nodes would con- 
sist of excitation from the feature nodes within each trace. At the top are shown the 
word nodes and the activations they receive in each time slice. Because the Input 
can be parsed in various ways, several word nodes are active simultaneously and 
overlap. 
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Along the bottom is shown sample input. The input is presented here in 
phonemic form for ease of representation; It would actually consist of the excita- 
tions from the (missing) feature nodes, which in turn would be excited by the 
speech input. 

Because the Input as shown could be parsed in different ways, the word 
nodes for s/a/tf, land, and bus ail receive some activation, slant is most heavily 
activated since K most closely matches the Input, but the sequence bus land is 
also entertained Presumably context and higher-level Information are used to 
provide the necessary input to disambiguate the situation. 

In this model, we can account for filHng-in effects in terms of top-down 
activations of phonemes at particular locations in the trace. One important advan- 
tage of the TRACE model is that a number of word tokens partially consistent with 
a stretch of the trace and each weakly activating a particular phoneme could 
conspire together to fid in a particular phoneme. Thus if the model heard fluggy, 
words which begin with flu.- such as fluster and flunk would activate phoneme 
nodes for /f/, /i/, and /e/ in the first part of the trace, and words which end 
with ..Jiggy such as buggy and muggy would activate nodes for /g/, and /I/ 
In the latter part of the trace. In this way the model could be made to account 
easily for filing in effects in pseudoword as well as word perception. 

Thte mechanism for using the lexicon to perceive non-words is Intriguing, 
because it suggests that some of the knowledge which linguists have assumed is 
represented by rules might located in the lexicon instead. Consider, for example, 
phonotactic knowledge Every language has certain sequences of sounds which 
are permissible and others which are not. English has no word bilk, but it might, 
whereas most speakers of English would reject bnik as being unacceptable. One 
might choose to conclude, therefore, that speakers have rules of the form 

(where the asterisk denotes ungrammaticaiity, and § indicates word beginning), or 
more generally 

* 

#[stop] L nasal] 

But in Tact, TRACE points to an alternative account for this behavior. If percep- 
tion of both words and nonwords Is mediated by the lexicon! then to the extent 
that a sequence of phonemes in a nonword occurs in the real words, TRACE will 
be able to sustain the pattern in the phoneme traces. If a sequence does not 
exist, the pattern will still be present in the trace, but only by virtue of bottom-up 
input, and weakly. TRACE predicts that phonotactic knowledge may not be hard- 
and-fast in the fashion that rule-governed behavior should be. Because there are 
some sequences which are uncommon, but which do Occur In English (e.g., initial sf 
clusters) listeners should be able to judge certain nonwords as more acceptable 
than others; and this is In fact what happens (Greenberg & Jenkins, 1964). 
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Another advantage to TRACE is that early portions of words would still be 
present in the trace and so would remain available for consideration and modffica- 
tioa Ambiguous early portions of a word could be filled In retrospectively once 
subsequent portions correctly specified the word. This would explain Is tenets' 
tendencies to hear an [h] In the phrase _ee/ of the shoe (Warren & Sherman, 



The TRACE model permits more ready extension of the Interactive activation 
approach to the perception of multi-word Input. One can Imagine the difficulties 
which would be presented In COHORT given input which could be parsed either as 
a single word, or several smaler words. Consider, for example, what would hap- 
pen If the system heard a string which could be Interpreted either as sell y*t light 
or cellulites Assume that later input will disambiguate the parsing, and that for 
the time being we wish to keep both possibilities active. Because words compete 
strongly with one another in COHORT, the nodes for sell, your, light, and cellullte, 
wifl all be in active competition with one another. The system will have no way of 
knowing that the competition Is reaBy only between the first three of these 
words— as a group— and the last. In TRACE, words still compete, but the competi- 
tion can be directed toward the portion of the Input they are attempting to 
account for. 
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CONCLUSIONS 



That speech perception is a complex behavior is a claim which is hardly novel 
to us. What we hope to have accomplished here is to have shed some Ight about 
exactly what It is about speech perception which makes it such a dtfficult task 
% to model, and to have shown why interactive activation models are such an 

appropriate framework for speech perception. Our basic premise b that attempts 
to model this area of human behavior have been seriously hampered by the lack of 
an adequate computational framework. 

During the course of an utterance a large number of factors interact and 
shape the speech stream. While there may be some acoustic invariance in the 
signal, sufch invariance seems to be atypical and limited. It seems clear that 
attempting to untangle these interactions within human information processing 
frameworks which resemble von Neumann machines is a formidable task. Those 
computer-based systems which have had any success, such as HARPY, have 
achieved real-time performance at the expense of flexibility and extensibility, 
and within a tightly constrained syntactic and lexical domain. We do not wish to 
downplay the importance of such systems. There are certainly many applications 
where they are very useful, and by ilustrating how far the so-called "engineer- 
ing" approach can be pushed they provide an important theoretical function as 
well. 

However, we do not believe that the apr.oach inherent in such systems will 
ever lead to a speech understanding system which performs nearly as weR as 
humans, at anywhere near the rates we are accustomed to perceiving speech. 
There is a fundamental flaw in the assumption that speech perception is carried 
out in a processor which looks at all like a digital computer. Instead, a more ade- 
quate model of speech perception assumes that perception is carried out over a 
large number of neuron-like processing elements in which there are extensive 
interactions. Such a model makes sense in terms of theoretical psychology; we 
would argue that it wil ultimately prove to be superior in practical terms as wefl. 

In this chapter we have described the computer simulation of one version 
(COHORT) of an interactive activation model of speech perception. This model 
reproduces several phenomena which we know occur in human speech perception. 
It provides an account for how knowledge can be accessed in paraBel, and how a 
large number of knowledge elements in a system can interact. It suggests one 
method by which some aspects of the encoding due to coarticulation might be 
decoded. And it demonstrates the paradoxical feat of extracting segments from 
the speech stream without ever doing segmentation. 

COHORT has a number defects. We have presented an outline of another 
model, TRACE, which attempts to correct some of these defects. TRACE shows 
that it is possible to integrate a dynamic working memory into an interactive 
activation model, and that this not only provides a means for perceiving nonwords 
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but also shows that certain type of knowledge can be stored in the lexicon which 
leads to what looks like rule-governed behavior. 

What we have said so far about TRACE is only its beginning* For one thing, 
the process by Which acoustic/phonetic features are extracted from the signal 
remains a chalenging task for the futura And we have yet to specify how the 
knowledge above the word level should come into play. One can imagine schema 
which correspond to phrases, and which have complex structures somewhat lice 
words, but there are doubtless many possibilities to explore. 

It is clear that a working model of speech perception which functions any- 
where nearly as well as humans do is a long way off. We do not claim that any of 
the versions we present here are the right ones, but we are encouraged by the 
Imited success of COHORT and the potential we see in TRACE, The basic 
approach is promising. 
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